Getting Set-up

Installing and Setting up RStudio

The R console looks like this:

File Organization

Make sure that you set up a folder for this class.

Using RMarkdown/knitr

You can knit the file. The first time you do this you will need to make sure you have the knitr package installed. You have the option to knit into .html, .pdf, and .doc. In general, in this course we will be knitting into .html.

RMarkdown formatting

To make something “code-looking” we use the grave accent ` found in the upper left of your keyboard.

To create a header, place a hash tag at the start of the line. For example, # Header 1 or create a level 2 header using ## Header Level 2.

To make text italics put asterisk around the text *like this*. To make text bold, put two asterisks around the text **like this**.

To make a list, just start creating your list using a - or * for each bullet, like this:

- list item 1
- list item 2

It is important that there is a blank line before the first bullet.

Add a link with the follwing code:

[Alt text that will display](www.google.com)

It will display like this:

Alt text that will display

Add an image with the following code:

![Alt text](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/rmarkdown_wizards.png)

It will display like this:

Alt text

Alt text

The vast majority of markdown syntax are available in the RStudio RMarkdown Cheatsheet, Section 3.

R Chunks

Create an R chunk:

2+2
## [1] 4

OR

x<-4

echo=T or echo=F– determines whether or not to echo the source code in the output file. This can be useful if you are creating a document for someone to read that doesn’t need to see or doesn’t want to see you code, just the output. In general in this course for assignments I would like your code to be echoed. The default is echo=F.

results=T or results=F – determines whether or not the results will be displayed. This can be useful if you want to show code, but don’t care what the output is. The default is eval=T.

eval=T or eval=F – determines whether or not to evaluate the code. This can be useful if you have a whole chunk of code you don’t want run, but you also don’t want to. The default is eval=T.

There are many, many more options including fig.width, fig.height, cache, etc. The vast majority of options are available in the RStudio RMarkdown Cheatsheet, Section 5.

You have the option to set the options individually on each chunk and/or set the global options by using the code knitr::opts_chunk$set(your options here)) in the first code chunk.

Inline Code

Rather than using a code chunk (which is centered in the middle of the page), you also have to options to use inline code. You can place the following within any sentence or paragraph.

`r codehere`

For example,

This is the number `r x`.

becomes… This is the number 4.

Installing Packages

Packages can contain lots of things including: data sets, functions, etc.

You can install packages using the packages tab or you can use the code install.packages('packageyouwant') in the console.

In each new R session where you want to use the package you will have to load it by typing library('packageyouwant') in the console (or in the RMarkdown document - more later).

To get help with a package (or a function in a package) you can type ?packagename into the console.

Additional Reading (Optional)

Some Basic R code

Variables, Calculations, Vectors

Assigning Variables:

x<-56

Calculations:

y <- x*2 #multiply
          # note that because value is assigned to y, it won't print out
y #prints out the value of y
## [1] 112
x/2 #divide
## [1] 28
x^2 #x to the power of 2
## [1] 3136

Vectors:

# c() function: concatenate
heights <- c(67, 100, 34, 78, 80)

Referencing Elements of a Vector:

heights[3]
## [1] 34

Adding to Vectors:

heights <- c(heights, 90)
heights
## [1]  67 100  34  78  80  90

Importing Data

From a file on your computer:

airbnb <- read.csv("http://ebmwhite.github.io/MATH0216/data/NYCairbnb2019.csv")

From the web:

library("openintro")
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
data(cars)

For now, we will mostly be working with .csv and .xls files. Later in the course, we may discuss other types of files.

Basics for Working with a Dataframe

Assessing Size:

# dim() spits out dimensions of a dataframe
dim(airbnb)
## [1] 48895    16

Names:

# names() spits out column names of a dataframe
names(airbnb)

Referencing Columns:

airbnb$latitude
airbnb[,3]
airbnb[,"latitude"]


attach(airbnb)
latitude

Calculations:

mean(airbnb$price)
## [1] 152.7207
median(airbnb$price)
## [1] 106
sd(airbnb$price) #standard deviation
## [1] 240.1542
# calculates the mean price, broken down by neighbourhood group
tapply(airbnb$price, airbnb$neighbourhood_group, mean)
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##      87.49679     124.38321     196.87581      99.51765     114.81233
#calculates the mean price, broken down by room type
tapply(airbnb$price, airbnb$room_type, mean)
## Entire home/apt    Private room     Shared room 
##       211.79425        89.78097        70.12759

Conditional Subsetting:

# prints out all the rows where the price is more than 8000
airbnb[airbnb$price >= 8000,]
##             id                                               name  host_id
## 4378   2953058                                      Film Location  1177497
## 6531   4737930                                 Spanish Harlem Apt  1235070
## 9152   7003697                Furnished room in Astoria apartment 20582832
## 12343  9528920                Quiet, Clean, Lit @ LES & Chinatown  3906464
## 17693 13894339    Luxury 1 bedroom apt. -stunning Manhattan views  5143901
## 29239 22436899                                1-BR Lincoln Center 72390391
## 30269 23377410  Beautiful/Spacious 1 bed luxury flat-TriBeCa/Soho 18128455
## 40434 31340283 2br - The Heart of NYC: Manhattans Lower East Side  4382127
##       host_name neighbourhood_group   neighbourhood latitude longitude
## 4378    Jessica            Brooklyn    Clinton Hill 40.69137 -73.96723
## 6531      Olson           Manhattan     East Harlem 40.79264 -73.93898
## 9152   Kathrine              Queens         Astoria 40.76810 -73.91651
## 12343       Amy           Manhattan Lower East Side 40.71355 -73.98507
## 17693      Erin            Brooklyn      Greenpoint 40.73260 -73.95739
## 29239    Jelena           Manhattan Upper West Side 40.77213 -73.98665
## 30269       Rum           Manhattan         Tribeca 40.72197 -74.00633
## 40434      Matt           Manhattan Lower East Side 40.71980 -73.98566
##             room_type price minimum_nights number_of_reviews last_review
## 4378  Entire home/apt  8000              1                 1  2016-09-15
## 6531  Entire home/apt  9999              5                 1  2015-01-02
## 9152     Private room 10000            100                 2  2016-02-13
## 12343    Private room  9999             99                 6  2016-01-01
## 17693 Entire home/apt 10000              5                 5  2017-07-27
## 29239 Entire home/apt 10000             30                 0            
## 30269 Entire home/apt  8500             30                 2  2018-09-18
## 40434 Entire home/apt  9999             30                 0            
##       reviews_per_month calculated_host_listings_count availability_365
## 4378               0.03                             11              365
## 6531               0.02                              1                0
## 9152               0.04                              1                0
## 12343              0.14                              1               83
## 17693              0.16                              1                0
## 29239                NA                              1               83
## 30269              0.18                              1              251
## 40434                NA                              1              365
# prints out all the rows where the neighbourhood group is Manhattan
# note the double equal sign
airbnb[airbnb$neighbourhood_group=="Manhattan",]

Basic Plotting:

hist(airbnb$price)

plot(airbnb$reviews_per_month, airbnb$price)

Best Practices

Commenting

  • Be sure to comment your code (in R, use a # before a line of comment)
  • The more descriptive you can be the easier it will be for other to read (and for you to read later)

Naming

When naming variables, observations, data frames, or files, make them:

  • meaningful
  • consistent
  • concise
  • code and coder friendly

Other naming considerations:

  • avoid names that are common/used function names (ie. filter or mean)
  • consider making object names nouns, and function names verbs
  • it’s not the end of the world if you give something a bad name, but it will save you (and others) time and effort down the road
  • avoid formatting and symbols (ie. spaces or &)
  • keep a clear record of your variable names as well as longer descriptions including units (ie. surface_temp= surface temperature measurement on Mars in degrees Celsius)

Entering Things

Some suggestions for best practices:

  • be consistent (ie. purple vs. Purple vs. purple_)
  • put any additional information such as units or notes in a column separate from the value
  • if there is missing entries, enter the name thing for each missing value (it is common to use NA, NaN, -9999, -); don’t leave cells blank
  • if data is abbreviated, make a record somewhere of how the what they mean

Example

by @alisonhorst

Bad data entry, by @alisonhorst Good data entry, by @alisonhorst

The Basics of Working with Missing Data

Missing data are usually in the data as NA, NaN, N/A, or -9999. When doing operators on numbers, most functions will return NA if the data includes missing values.

mean(airbnb$reviews_per_month)
## [1] NA
# use arguement na.rm to remove NAs
mean(airbnb$reviews_per_month, na.rm=T)
## [1] 1.373221
#OR 

# use function na.omit() to return a vector without NAs, then take the mean
mean(na.omit(airbnb$reviews_per_month))
## [1] 1.373221

Working with Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and plotting.

#to check whether something is a factor
is.factor(cars$type)
## [1] TRUE
# to make something a factor
cars$type <- factor(cars$type)

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.

#to see the levels
levels(cars$type)
## [1] "large"   "midsize" "small"
#to see the number of levels
nlevels(cars$type)
## [1] 3
#to see how many are in each level
table(cars$type)
## 
##   large midsize   small 
##      11      22      21